Training topic classifiers for conversational speech with limited data
نویسندگان
چکیده
In this paper we demonstrate how automatically generated transcriptions can be used to develop an effective topic classification application. Two key contributions of our work are (a) investigating the impact of unsupervised transcriptions on topic classification where the transcription system has been trained with very limited amounts of data, and (b) demonstrating the use of mixture language models that significantly improve topic classification performance.
منابع مشابه
Getting More Mileage from Web Text Sources for Conversational Speech Language Modeling using Class-Dependent Mixtures
Sources of training data suitable for language modeling of conversational speech are limited. In this paper, we show how training data can be supplemented with text from the web filtered to match the style and/or topic of the target recognition task, but also that it is possible to get bigger performance gains from the data by using class-dependent interpolation of N-grams.
متن کاملConfidence-Based Techniques for Rapid and Robust Topic Identification of Conversational Telephone Speech
We investigate the impact of automatic speech recognition errors on the accuracy of topic identification in conversational telephone speech. We present a modified TF-IDF featureweighting calculation that provides significant robustness under various recognition error conditions. For our experiments we take conversations from the Fisher corpus to produce 1-best and lattice outputs using one reco...
متن کاملUsing conversational word bursts in spoken term detection
We describe a language independent word burst feature based on the structure of conversational speech that can be used to improve spoken term detection (STD) performance. Word burst refers to a phenomenon in conversational speech in which particular content words tend to occur in close proximity of each other as a byproduct of the topic under discussion. To take advantage of bursts, we describe...
متن کاملA Boosting Approach to Topic Spotting on Subdialogues
We report the results of a study on topic spotting in conversational speech. Using a machine learning approach, we build classifiers that accept an audio file of conversational human speech as input, and output an estimate of the topic being discussed. Our methodology makes use of a wellknown corpus of transcribed and topic-labeled speech (the Switchboard corpus), and involves an interesting do...
متن کاملClass-dependent Interpolation for Estimating Language Models from Multiple Text Sources
Sources of training data suitable for language modeling of conversational speech are limited. In this paper, we show how training data can be supplemented with text from the web filtered to match the style and/or topic of the target recognition task, but also that it is possible to get bigger performance gains from the data by using class-dependent interpolation of N-grams.
متن کامل